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1.  Introduction 


1.1  What  is  Multibase? 

The  database  approach  to  data  processing  requires 
that  all  of  the  data  relevant  to  an  enterprise  be  stored 
in  an  integrated  database.  By  "integrated",  we  mean  that  a 
single  schema  (i.e.,  database  description)  describes  the 
entire  database,  that  all  accesses  to  the  database  are 
expressed  relative  to  that  schema,  and  that  such  accesses 
are  processed  against  a  single  (logical)  copy  of  the  data¬ 
base.  Unfortunately,  in  the  real  world  many  databases  are 
not  integrated.  Often,  the  data  relevant  to  an  enterprise 
is  implemented  by  many  independent  databases,  each  with 
its  own  schema.  Such  databases  are  nonintegrated.  Furth¬ 
ermore,  these  databases  may  be  managed  by  different  data¬ 
base  management  systems  (DBMS) ,  perhaps  on  different 
hardware.  In  this  case,  the  databases  are  distributed  and 
heterogeneous,  in  addition  to  being  nonintegrated.  Thus, 
the  real  world  of  nonintegrated,  heterogeneous,  distri¬ 
buted  databases  differs  greatly  from  the  more  ideal  world 
of  an  integrated  database. 

Nonintegrated,  heterogeneous,  distributed  databases 
arise  for  several  reasons.  First,  many  of  these  databases 
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were  created  before  the  benefits  of  integrated  databases 
were  well  understood.  In  those  days,  total  integration 
was  not  a  principal  database  design  goal.  Second,  the 
lack  of  a  central  database  administrator  for  some  enter¬ 
prises  has  made  it  difficult  for  independent  organizations 
within  an  enterprise  to  produce  an  integrated  database 
suitable  for  all  of  them.  Third,  the  large  size  of  many 
data  processing  applications  has  made  distribution  a 
necessity,  simply  to  handle  the  volume  of  work.  Since 
integrated  distributed  DBMSs  have  not  been  available,  it 
has  been  necessary  to  implement  applications  on  different 
machines.  Since  different  applications  often  have  dif¬ 
ferent  performance  and  functionality  requirements,  dif¬ 
ferent  DBMSs  were  often  selected  to  run  on  these  machines 
to  meet  these  different  requirements.  Many  data  processing 
organizations  have  experienced  these  problems,  and  so 
there  are  many  nonintegrated,  heterogeneous,  distributed 
databases  in  the  world. 

A  principal  problem  in  using  databases  of  this  type 
is  that  of  integrated  retrieval.  In  such  databases,  each 
independent  database  has  its  own  schema,  expressed  in  its 
own  data  model,  and  can  be  accessed  only  by  its  own 
retrieval  language.  Since  different  databases  in  general 
have  different  schemata,  different  data  models,  and  dif¬ 
ferent  retrieval  languages,  many  difficulties  arise  in 
formulating  and  implementing  retrieval  requests  (called 
queries)  that  require  data  from  more  than  one  database. 
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These  difficulties  include:  resolving  incompatibilities 
between  the  databases  such  as:  differences  of  data  types 
and  conflicting  schema  names;  resolving  inconsistencies 
between  copies  of  the  same  information  stored  in  different 
databases;  and  transforming  a  query  expressed  in  the 
user's  language  into  a  set  of  queries  expressed  in  the 
many  different  languages  supported  by  the  different  sites. 
Implementing  such  a  query  usually  consumes  months  of  pro¬ 
gramming  time,  making  it  a  very  expensive  activity.  Some¬ 
times,  the  necessary  effort  is  so  great  that  implementing 
the  query  is  not  feasible  at  all. 

Multibase  is  a  software  system  that  helps  integrate 
nonintegrated,  heterogeneous,  distributed  databases.  Its 
main  goal  is  to  present  the  illusion  of  an  integrated 
database  to  users  without  requiring  that  the  database  be 
physically  integrated.  It  accomplishes  this  by  allowing 
users  to  view  the  database  through  a  single  global  schema 
and  by  allowing  them  to  access  the  data  using  a  high  level 
data  manipulation  language  (DML) .  Queries  posed  in  this 
language  are  entirely  processed  by  Multibase  as  if  the 
database  were  integrated,  homogeneous,  and  non- 
distributed.  Multibase  uses  the  Functional  Data  Model  to 
define  the  global  schema,  and  the  language  ADAPLEX  (DAPLEX 
imbeded  in  ADA)  as  the  high  level  DML.  The  functional 
data  model  and  ADAPLEX  are  discussed  in  Section  5. 
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1.2  Implementation  Objectives 

There  are  many  approaches  to  the  design  of  the  Multi¬ 
base  system.  In  deciding  which  approach  to  choose,  we 
begin  with  the  following  design  objectives. 

1)  Generality:  we  do  not  want  to  design  an  application 
specific  Multibase  system.  Instead,  we  want  to  provide 
powerful  generali2ed  tools  which  can  be  used  to 
integrate  various  database  systems  for  various  appli¬ 
cations  with  a  minimum  of  programming  effort. 

2)  Extendability :  we  want  a  design  which  allows  expansion 
of  functionality  without  major  modification.  There 
are  areas  in  the  Multibase  design  where  substantial 
research  effort  is  still  required,  and  so  we  must  be 
able  to  add  additional  features  to  the  Multibase  sys¬ 
tem  as  results  come  in. 

3)  Compatibility:  we  want  a  design  that  does  not  render 
existing  software  invalid,  because  this  represents  a 
very  large  investment.  Thus,  we  must  leave  the  exist¬ 
ing  interface  to  the  local  DBMS  intact. 

The  proposed  architecture  of  the  Multibase  system 
consists  of  two  basic  components:  a  schema  design  aid  and 
a  run-time  query  processing  subsystem.  The  schema  design 
aid  provides  tools  to  the  "integrated"  database  designer 
to  design  the  global  schema  and  to  define  a  mapping  from 
the  local  databases  to  the  global  schema.  The  run-time 
query  processing  subsystem  then  uses  the  mapping  defini- 
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tion  to  translate  global  queries  into  local  queries, 
ensuring  that  the  local  queries  are  executed  correctly  and 
efficiently  by  local  DBMSs.  The  schema  design  aid  is  dis¬ 
cussed  first. 

1.3  Schema  Architecture 

The  Multibase  architecture  has  three  levels  of  sche¬ 
mata,  a  global  schema  (GS)  at  the  top  level,  an  integra¬ 
tion  schema  (IS)  and  one  local  schema  (LS)  per  local  data¬ 
base  at  the  middle  level,  and  one  local  host  schema  (LHS) 
per  local  database  at  the  bottom  level.  These  components 
and  their  interrelationships  are  depicted  in  Figure  1.1. 

The  local  host  schemata  are  the  original  existing 
schemata  defined  in  local  data  models  and  used  by  the 
local  DBMSs.  For  example  they  can  be  relational,  file,  or 
Codasyl  schemata.  Each  of  these  LHSs  is  translated  into  a 
local  schema  (LS)  defined  in  the  Functional  Data  Model. 
By  expressing  the  LSs  in  a  single  data  model,  higher  lev¬ 
els  of  the  system  need  not  be  concerned  with  data  model 
differences  among  the  local  DBMSs.  In  addition,  there  is 
an  integration  schema  that  describes  a  database  containing 
information  needed  for  integrating  databases.  For  exam¬ 
ple,  suppose  one  database  records  the  speed  of  ships  in 
miles  per  hour,  while  the  other  records  it  in  kilometers 
per  hour.  To  integrate  these  two  databases,  we  need  infor- 
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Schema  Architecture 


Figure  1.1 


mation  about  the  mapping  between  these  two  scales.  This 
information  is  stored  in  the  integration  database. 
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The  LSs  and  IS  are  mapped,  via  a  view  mapping,  into 
the  global  schema  (GS) .  The  GS  allows  users  to  pose 
queries  that  appear  to  be  on  a  homogeneous  and  integrated 
database.  Roughly  speaking,  the  LHS  to  LS  mapping  pro¬ 
vides  homogeneity  and  the  LS  and  IS  to  GS  mapping  provides 
integration.  The  schema  design  aid  provides  tools  to  the 
database  designer  to  define  LSs,  the  GS,  and  the  mapping 
among  them  and  the  LHSs. 


1.4  Query  Processing  Architecture 

The  architecture  of  the  run-time  query  processing 
subsystem  consists  of  the  Multibase  software  and  local 
DBMSs.  These  components  and  their  interrelationships  are 
depicted  in  Figure  1.2.  The  users  submit  queries  over  the 
global  schema  (called  global  queries)  to  the  Multibase 
software  which  translates  them  into  subqueries  over  local 
schemata  (called  local  queries) .  These  local  queries  are 
then  sent  to  local  DBMSs  to  be  executed. 

Since  the  global  queries  are  posed  against  the  global 
schema  without  any  knowledge  of  the  distribution  of  the 
data  and  the  availability  of  "fast  access  paths",  the  Mul¬ 
tibase  software  must  optimize  queries  so  they  can  be  exe¬ 
cuted  efficiently.  In  addition,  the  translation  process 
must  also  be  correct;  that  is,  the  local  queries  must 
retrieve  exactly  the  same  information  which  the  original 
global  query  requests. 
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Run-Time  Query  Processing  Subsystem 


Figure  1.2 


global  queries 
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1.5  Meeting  the  Objectives 

The  proposed  architecture  meets  the  objective  of  gen¬ 
erality.  The  only  component  of  the  Multibase  system  which 
is  customized  for  the  application  is  the  global  schema  and 
its  mapping  definition  to  the  local  schemata.  The  only 
component  of  Multibase  that  is  customized  for  the  local 
DBMSs  is  the  interface  software  that  allows  Multibase  to 
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communicate  with  the  heterogeneous  DBMSs  in  a  single 
language.  These  are  only  small  components  of  the  Multi¬ 
base  system.  Thus,  most  of  Multibase  is  neither 
application-specific  nor  DBMS-specific.  Multibase  also 
meets  the  objective  of  compatibility,  because  local  data¬ 
bases  are  not  modified,  therefore,  existing  application 
programs  can  still  access  local  databases  through  local 
DBMSs.  And  as  the  details  of  the  architecture  are  dis¬ 
cussed  in  later  sections,  it  will  become  clear  that  the 
objective  of  extendabil ity  is  also  met. 


1.6  Summary  of  The  Report 

.he  architecture  of  the  Multibase  system  is  expanded 
in  more  detail  in  Section  2.  The  process  of  mapping  each 
LHS  to  a  LS  and  merging  LSs  into  a  GS  is  discussed  in  Sec¬ 
tion  3.  Section  3  also  discusses  the  problem  of  data 
incompatibility  and  inconsistency.  The  method  by  which 
user  queries  are  translated  into  efficient  local  queries 
is  discussed  in  Section  4.  The  mapping  language  ADAPLEX 
and  the  use  of  the  language  in  defining  the  mappings 
between  LSs  and  GS  are  discussed  in  section  5.  Section  6 


i: 
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2.  Run-Time  Query  Processing  Subsystem 

The  architecture  of  the  Multibase  run-time  subsystem 
consists  of: 

1.  a  query  translator, 

2.  a  query  processor, 

3.  a  local  database  interface  (LDI)  for  each  local  DBMS, 

4.  local  DBMSs. 

A  global  query  references  entity  types  and  functions 
defined  in  the  global  schema.  Before  it  can  be  processed, 
it  is  first  translated  into  a  query  referencing  only 
entity  types  and  functions  defined  in  the  local  schemata, 
by  the  query  translator.  In  other  words,  the  query  trans¬ 
lator  translates  a  global  query  over  the  global  schema 
into  a  global  query  over  the  disjoint  union  of  local  sche¬ 
mata.  The  query  processor  decomposes  the  global  query 
over  the  disjoint  union  of  local  schemata  into  individual 
local  queries  over  local  schemata.  The  query  processor 
also  does  query  optimization  and  coordinates  the  execution 
of  local  queries.  The  LDI  translates  local  queries 
received  from  the  query  processor  into  queries  expressed 
in  the  local  DML  and  translates  the  results  of  the  local 
queries  into  a  format  expected  by  the  query  processor. 
These  components  and  their  interrelationships  are  depicted 
in  Figure  2.1. 
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Run-Time  Query  Processing  Subsystem  Figure  2.1 


query  over  global  schema 
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2.1  The  User  Interface 

*he  global  schema  is  expressed  in  the  functional  data 
model  (Shipman],  In  this  data  model,  a  schema  is  composed 
of  entity  types  and  functions  between  entity  types.  Each 
entity  type  contains  a  set  of  entities,  so  functions  map 
entities  into  entities.  Functions  can  be  si.ngle~.valued  or 
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The  functional  data  model  was  selected  because  it 
embodies  the  main  structures  of  both  the  flat  file  data 
models,  such  as  the  relational  model,  and  the  link  struc¬ 
tured  data  models,  such  as  Codasyl.  Entity  types 
correspond  roughly  to  relations  in  the  relational  model  or 
record  types  in  the  Codasyl  model.  Functions  correspond 
to  owner-coupled  sets  in  the  Codasyl  model. 

The  DML  that  we  use  with  the  functional  data  model  is 
called  ADAPLEX  —  the  data  language  DAPLEX  embedded  in  the 
programming  language  ADA.  DAPLEX  is  a  high  level  DML  that 
operates  on  data  in  the  functional  data  model  and  is 
designed  to  be  especially  easy  to  use  by  end  users.  Com¬ 
putationally,  it  is  as  powerful  as  relational  calculus. 
Since  it  is  embedded  in  ADA,  it  acquires  the  additional 
power  of  a  procedural  language.  Details  of  the  language 
are  discussed  in  Section  5. 

2.2  Query  Translator 

*he  query  translator  receives  global  queries 
expressed  in  ADAPLEX  over  the  GS  and  translates  them  into 
queries  expressed  in  an  internal  language  over  the  dis¬ 
joint  union  of  LSs  and  IS. 

To  perform  the  translation,  the  query  translator  must 
use  the  mapping  that  defines  how  entity  types  and  func¬ 
tions  of  the  GS  are  constituted  from  the  entity  types  and 
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functions  of  the  LS  and  the  IS.  The  query  translator  uses 
these  mapping  definitions  to  substitute  global  entity 
types  and  global  functions  in  the  global  query  by  their 
mapping  definitions.  The  substitution  results  in  a  query 
containing  only  entity  types  and  functions  of  the  LSs  and 
the  IS.  Therefore  references  by  the  global  query  to  enti¬ 
ties  in  the  GS  are  now  expressed  as  references  to  the 
actual  entities  at  particular  sites  that  implement  the 
global  GS.  Any  extra  data  needed  from  the  integration 
database  to  resolve  incompatibilities  among  LSs  is  now 
explicitly  referenced  in  the  translated  query. 

The  query  produced  by  the  query  translator  only 
references  data  in  the  LS  and  the  IS.  Thus,  we  can  ima¬ 
gine  that  this  query  is  posed  against  a  database  state 
that  is  the  disjoint  union  of  the  LSs  together  with  the 
IS.  This  disjoint  union  is  a  homogeneous  and  centralized 
view  of  the  distributed  heterogeneous  database. 

The  language  used  for  defining  the  mapping  between 
schemata  must  be  compatible  with  the  global  DML.  Other¬ 
wise,  it  would  be  awkward  to  translate  the  query  from  the 
GS  to  LSs  and  IS  using  conventional  query  modification 
techniques  (Query  modification  composes  the  given  query, 
which  is  a  function  from  GS  states  to  answer  states,  with 
the  mapping  from  LS  and  IS  states  to  GS  states,  to  produce 
a  query  from  LS  and  IS  states  to  answer  states;  (cf. 

[Ston  75].)  Therefore,  we  propose  to  use  the  same  language 

j 

ADAPLEX  as  both  the  query  and  mapping  language.  The  -i 
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process  of  constructing  the  global  schema  from  the  local 
schemata  is  discussed  in  Section  3. 


2.3  Query  Processor 

The  query  processor  translates  a  query  over  the  dis¬ 
joint  union  of  LSs  and  IS  into  a 
query  processing  strategy.  This  strategy  includes:  a  set 
of  queries  each  of  which  is  posed  against  exactly  one  LS 
or  the  IS;  a  set  of  "move"  operations  to  ship  the  results 
of  these  queries  between  the  local  DBMSs  and  the  query 
processor;  and  a  set  of  queries  that  is  executed  locally 
by  the  query  processor  to  integrate  the  results  of  the  LS 
and  IS  queries.  The  main  goal  of  this  translation  is  to 
minimize  the  total  cost  of  evaluating  the  query,  where 
cost  is  measured  by  local  processing  time  and  communica¬ 
tion  volume. 

A  query  processing  strategy  is  produced  in  two  steps. 
First,  the  query  is  translated  into  an  internal  represen¬ 
tation  called  a  query  graph.  Using  this  representation, 
the  query  processor  isolates  those  subqueries  of  the  given 
query  (which  are  essentially  subgraphs  of  the  query  graph) 
that  can  be  entirely  evaluated  at  one  local  DBMS.  So,  the 
result  of  the  first  step  is  the  set  of  single-site 
subqueries  of  the  given  query. 
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The  second  step  is  to  combine  the  single-site  queries 
with  move  operations  and  local  queries  issued  by  the  query 
processor.  Move  operations  serve  two  purposes.  First, 
they  are  used  to  gather  the  results  of  the  single-site 
queries  back  to  the  query  processor.  These  results  can  be 
integrated  by  the  query  processor  by  executing  a  query 
local  to  itself.  The  integrated  results  may  be  the  answer 
to  the  query,  in  which  case  they  are  returned  to  the  user. 
Second,  they  may  be  used  as  input  to  other  single-site 
queries.  In  this  case,  a  move  operation  is  issued  to  ship 
the  data  to  the  local  DBMS  that  needs  it.  The  method  by 
which  single-site  queries,  move  operations,  and  queries 
local  to  the  query  processor  are  sequenced  to  produce  a 
correct  and  efficient  strategy  is  discussed  in  Section  4. 


2.4  Local  Database  Interface  (LDI) 

Local  queries  posed  against  the  LSs  are  sent  by  the 
query  processor  to  the  LDIs  in  an  internal  format.  The 
LDI  translates  these  local  queries  into  programs  in  the 
local  DML  and  programming  language  over  the  local  host 
schema  (LHS) .  This  translation  is  optimized  to  minimize 
the  processing  time  of  the  translated  query.  When  the 
local  DBMS  uses  a  high  level  (i.e.,  set-at- a-time) 
language,  such  as  DAPLEX,  this  translation  is  fairly 
direct.  However,  when  the  local  DBMS  uses  a  low  level 
(i.e.,  record-at-a-time)  language,  such  as  Codasyl  DML 
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embedded  in  COBOL,  this  translation  may  be  quite  complex 
and  may  require  nontrivial  optimization.  Translation 
methods  for  a  file  system  and  Codasyl  language  are 
described  in  Section  4. 

To  do  the  translation,  the  LDI  must  have  information 
about  how  entity  types  and  functions  in  the  LS  are  mapped 
to  objects  in  the  LHS.  These  mappings  are  defined  using 
the  rules  discussed  in  Section  3.1. 


2.5  Initial  Implementation 

In  this  section,  we  have  sketched  the  basic  architec¬ 
ture  of  Multibase.  Details  of  individual  components 
appear  in  later  sections.  These  sections  outline  prelim¬ 
inary  solutions  to  most  of  the  technical  problems  that 
need  to  be  solved  to  build  Multibase.  These  solutions 
will  be  the  basis  of  a  "breadboard"  implementation  to  be 
completed  by  the  end  of  1980. 

We  emphasize,  however,  that  Multibase  is  a  three  year 
research  project  of  which  only  six  months  have  elapsed. 
As  new  and  better  solutions  to  technical  problems  are 
discovered,  they  will  be  incorporated  into  later  versions 
of  the  system.  In  particular,  optimization  (both  local 
and  global)  of  query  processing  and  techniques  for  han¬ 
dling  data  incompatibility  will  be  the  subject  of  future 


research 
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3.  Schema  Integration  Architecture 

"Schema  Integration"  is  the  process  of  defining  a 
global  schema  and  its  mapping  from  the  existing  local 
schemata.  The  general  architecture  of  this  design  process 
is  discussed  in  this  section. 

There  is  one  local  host  schema  (LHS)  for  each  local 
database.  Each  LHS  can  be  a  relational,  a  Codasyl,  or  a 
file  system.  To  merge  these  LHSs  we  must  convert  them 
into  a  common  data  model  first.  Otherwise,  we  would  be 
mixing  relations  from  a  relational  model  with  record  types 
and  set  types  from  a  Codasyl  model.  Thus  the  first  step 
of  schema  integration  is  to  translate  LHSs  into  Local 
Schemata  (LS)  defined  in  the  Functional  Data  Model  of  ADA- 
PLEX. 

The  second  step  is  to  merge  LSs  into  a  GS.  To  do 
this,  an  integration  schema  which  defines  an  integration 
database  is  often  needed.  An  integration  database  con¬ 
tains:  information  about  mapping  between  different  scales 
used  by  different  LSs  for  the  same  entity  type;  statisti¬ 
cal  information  about  imprecise  data;  and  other  informa¬ 
tion  needed  for  reconciling  inconsistency  between  copies 
of  the  same  data  stored  in  different  databases.  The 
integration  schema  and  LSs  are  then  used  to  define  a  glo¬ 
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The  overall  architecture  of  schema  integration  con¬ 
sists  of: 

a)  a  global  schema, 

b)  a  mapping  language, 

c)  local  schemata  (LS)  and  an  integration  schema  (IS) , 

d)  a  mechanized  local-to-host  schema  translator, 

e)  local  host  schemata  (LHS)  and  local  DBMSs. 

These  components  and  their  interrelationships  are  depicted 
in  figure  3.1.  The  local  host  schemata  are  translated 
into  local  schemata  by  the  mechanized  local  host  schema 
translator,  and  local  schemata  and  the  IS  are  mapped  into 
the  GS  by  using  the  mapping  language  facility. 


3.1  Mapping  Between  LHS  and  LS 

Since  a  LHS  can  be  defined  in  the  relational, 
Codasyl,  or  file  model,  how  a  LHS  is  mapped  into  a  LS 
depends  on  the  data  model  used. 


3.1.1  Codasyl  Model 

If  a  LHS  is  defined  in  the  Codasyl  model,  then  it 
consists  of  record  types  and  set  types.  The  functional 
data  model  consists  of  entity  types  and  functions  on 
entity  types.  So,  to  map  the  LHS  into  a  LS  one  simply 
maps  record  types  and  set  types  into  entity  types  and 
functions  respectively. 
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Schema  Integration  Architecture 

users 


Figure  3.1 


Global  Schema 


Mapping 


Language 

i 


Facility 


The  concept  of  record  type  in  the  Codasyl  model  is 
very  similar  to  that  of  entity  type  in  the  functional  data 
model.  A  record  in  the  Codasyl  model  has  a  record  ID,  and 
one  or  several  attributes.  The  record  ID  uniquely  identi¬ 
fies  the  record,  and  the  attributes  describe  properties  of 
the  record.  Similarly,  in  the  functional  data  model,  an 
entity  is  an  object  of  interest,  and  the  functions  defined 
on  the  entity  return  values  which  describe  the  properties 
of  the  entity.  Therefore,  a  record  type  corresponds  to  an 
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entity  type,  and  the  attributes  of  the  record  type 
correspond  to  functions  defined  on  the  entity  type. 

If  an  attribute  of  a  record  type  is  a  key  (in  Codasyl 
terminology,  a  key  is  the  data  item(s)  declared  "NO  DUPLI¬ 
CATE  ALLOWED")  then  the  corresponding  function  must  be  a 
totally  defined  one-to-one  mapping.  If  the  attribute  is  a 
repeating  group  (declared  to  have  multiple  occurrences  in 
a  Codasyl  model) ,  then  the  function  is  a  set-valued  func¬ 
tion. 


A  set  type  in  the  Codasyl  model  is  a  mapping  between 
an  owner  record  type  and  one  or  several  member  record 
types.  A  set  type  maps  an  owner  record  to  a  set  of  member 
records,  or,  conversely,  a  set  type  maps  a  member  record 
to  a  unique  owner  record.  Therefore,  a  set  type  resembles 
a  function  which  maps  an  owner  entity  to  a  set  of  member 
entities,  or,  conversely,  maps  a  member  entity  to  a  unique 
owner  entity. 

In  a  Codasyl  model,  a  set  type  implies  not  only  cer¬ 
tain  semantic  information  but  also  the  existence  of  access 
paths.  For  example  a  set  type  "work-in"  between  "depart¬ 
ment"  and  "employee"  record  types  implies  that  the  employ¬ 
ees  owned  by  a  department  work  in  that  department.  But  it 
also  implies  that  there  is  an  access  path  from  a  depart¬ 
ment  record  to  the  employee  records  owned  by  that  depart¬ 
ment  and  another  access  path  from  each  employee  record  to 
its  own  department  record.  Since  the  LSs  will  be  used  for 
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query  optimization,  we  must  capture  all  this  access  path 
information  in  the  LSs.  Therefore,  for  each  set  type  in  a 
LHS ,  not  only  a  set-valued  function  from  the  owner  entity 
type  to  the  member  entity  type,  but  also  a  single-valued 
function  from  each  of  the  member  entity  types  to  the  owner 
entity  type  must  be  defined  in  the  corresponding  LS. 

In  a  Codasyl  model,  a  record  type  can  be  declared  to 
have  a  "LOCATION  MODE  CALC  USING  KEY".  This  means  that  an 
index  file  is  created  for  the  key,  and  the  record  type  is 
directly  accessible  through  the  indexed  key.  Therefore, 
for  each  record  type  with  "CALC  KEY"  in  the  LHS,  a  system 
set  function  of  which  the  domain  is  the  key  value  and  the 
range  is  the  entity  type  (corresponding  to  the  record 
type)  must  be  defined  in  the  LS.  This  system  set  function 
will  be  used  only  for  query  processing  optimization.  It 
is  not  visible  to  the  database  designer.  Therefore,  it 
can  not  be  incorporated  into  the  global  schema.  This  res¬ 
triction  is  imposed  to  preserve  the  data  independence  of 
the  global  schema. 

For  example,  the  Codasyl  schema  shown  in  Figure  3.2 
is  translated  into  the  schema  in  the  functional  data  model 
shown  in  Figure  3.3.  In  Figure  3.3,  the  inverse  of  a 
function  F  is  denoted  by  "F-inv". 
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A  Codasyl  Schema 


Figure  3.2 
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A  Schema  in  the  Functional  Data  Model  Figure  3.3 


type  shipclass  Is  entity 


classname 
length 
draf  t 
beam 

displacement 
endurance 
consists_of 
£nd  entity ; 


string (1 . 

.24) 

string (1 . 

.6)  ; 

string (1 . 

.2)  ; 

string (1 . 

.3)  ; 

string (1 . 

.5)  ; 

string (1 . 

.3)  ; 

set  of  ship; 

type  system  is  entity 

all-class  :  shipclass 
all-ship  :  ship; 

£nd  entity; 


type  ship 
UIC 
VCN 
name 
type 
flag 
owner 


is  .entity 

string(l.  .6)  ; 
string(l. .5)  ; 
string (1 . .  26 ) 
string(l.  .4)  ; 
string (1 . .  2)  ; 
string (1 . . 2)  ; 
str ing (1 . . 4 ) 


hull 

positions  :  set  of  trackhist; 
consists_of_inv  :  shipclass; 
end  entity ; 


type  trackhist  is  en.ti.ty 
DTG  :  string  (1..10) 

speed  ;  string  (1..3  ) 

latitude  :  string  (1..5) 

longitude  :  string  (1..6) 

course  :  string  (1..3) 

positions_inv  :  ship; 
end  entity; 


3.1.2  Relational  Model 

A  relational  database  schema  consists  of  a  set  of 
relation  definitions.  To  translate  a  relational  LHS  to  a 
functional  LS  we  essentially  map  each  relation  to  an 
entity  type.  A  tuple  of  a  relation  in  a  relational  model 
is  similar  to  an  entity  in  a  functional  data  model.  A 
tuple  is  uniquely  identified  by  its  primary  key  and  has 
one  or  more  attributes,  just  as  an  entity  has  one  or  more 
functional  values.  Therefore,  to  map  a  relational  model 
LHS  into  a  functional  data  model  LS,  for  each  relation  in 
the  LHS  an  entity  type  is  defined  in  the  LS,  and  for  each 
attribute  of  the  relation  a  function  is  defined  on  the 
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corresponding  entity  type.  The  range  of  the  function  is 
the  domain  of  the  attribute.  If  the  attribute  is  a  pri¬ 
mary  key,  then  the  function  must  be  totally  defined  and 
one-to-one.  If  it  is  a  candidate  key,  then  the  function 
can  be  partially  defined,  but  it  must  still  be  one-to-one. 
In  any  case,  due  to  the  relational  format,  the  function 
must  be  single-valued,  not  set-valued.  For  example,  the 
relational  LHS  shown  in  figure  3.4  is  translated  into  the 
functional  data  model  LS  shown  in  figure  3.5. 


A  Relational  Model 


Figure  3.4 
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A  Schema  in  Functional  Data  Model 


Figure  3.5 
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3.1.3  File  Model 


A  key-file  model  consists  of  record  files  and  indexed 
fields  (keys)  in  those  files.  A  record  file  consists  of  a 
set  of  records  of  the  same  type,  which  is  similar  to  the 
concept  of  record  type  in  the  Codasyl  model  or  a  relation 
in  the  relational  model.  To  map  a  key-file  LHS  to  a  func¬ 
tional  data  model  LS,  for  each  record  file  in  LHS  a 
corresponding  entity  type  must  be  defined  in  the  LS,  and 
for  each  field  of  the  record  file  a  function  must  be 
defined  on  the  entity  type.  Since  a  key  supports  an 
access  path  to  the  record  file,  for  each  key  of  a  record 
file,  a  system  function  must  be  defined  whose 


domain  is 
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the  key  field's  entity  type  and  whose  range  is  the  entity 
type  corresponding  to  the  record  file.  This  system  func¬ 
tion  is  not  visible  to  the  database  designer;  it  is  used 
only  for  query  optimization. 


3.2  Integration  of  LSs 

To  integrate  LSs  into  a  global  schema,  the  database 
designer  designs  an  integration  schema  which  defines  an 
integration  database.  He  then  designs  a  global  schema  and 
defines  it  in  terms  of  the  LSs  and  the  Integration  Schema 
by  using  the  view  support  facility. 

An  integration  database  contains  information  needed 
for  merging  entity  types  and  their  functions.  For  example, 
two  entity  types,  El  and  E2,  from  two  schemata  are  shown 
in  figure  3.6.  These  two  entity  types  represent  informa¬ 
tion  about  ships.  There  are  two  functions  defined  on  each 
entity  type;  one  function  returns  the  ship-id  of  a  ship 
and  the  other  returns  the  ship-class  of  the  ship.  The 
ship-class  of  El  and  E2  are  coded  differently.  A  sample  of 
entities  and  their  functional  values  are  also  shown  in 
figure  3.6.  To  merge  El  and  E2  into  a  single  entity  type, 
a  uniform  code  must  be  defined  and  the  two  existing  codes 
must  be  mapped  to  the  new  code.  Definitions  of  the  new 
code  and  the  mapping  function  are  shown  in  figure  3.7,  and 
a  sample  of  the  function  is  shown  in  figure  3.8.  The 


*;  •  -+\. 


The  Basic  Architecture  of  Multibase 
Schema  Integration  Architecture 


Page  -27- 
Section  3 


definitions  of  the  new  code  and  the  function  are  stored  in 
the  integration  database.  A  global  schema  defined  on  the 
two  local  schemata  and  the  integration  schema  is  shown  in 
Figure  3.9. 


Local  Schemata 

type  El  is  entity 

shipidl  :  integer; 
classl  :  codel ; 
end  entity ; 


Figure  3.6 

type  E2  is  entity 

shipid2  :  integer; 
class2  :  code2; 
end  entity; 


El 

shipidl 

classl 

E2 

shipid2 

class2 

ell 

1212 

cl 

e21 

3440 

d2 

el2 

1240 

c3 

e22 

3651 

d3 

el3 

2341 

c5 

e23 

4411 

d4 

Integration  Database  Figure  3.7 

type  code  is  entity 
end  entity? 

Define  a  new  function 

f  :  (codel  union  code  2)  ->  code. 


Sample  of  Function  f  Figure  3.8 
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Global  Schema  and  The  Mapping  rigure  3.9 

typ.e  e  la  entity 

shipid  :  integer; 
class  :  code; 
end  entity; 


As  the  discussion  above  indicates,  integration  of 
local  schemata  which  are  not  disjoint  involves  two  activi¬ 
ties:  merging  of  entity  types  and  merging  of  their  func¬ 
tions.  These  activities  are  discussed  in  Sections  3.3. 
Two  special  problems  relating  to  schema  integration,  the 
creation  of  new  entity  types,  and  the  integration  of 
incompatible  data,  are  discussed  in  Sections  3.4  and  3.5 
respectively. 


3.3  Merging  Entity  Types  and  Functions 

To  merge  two  entity  types,  say  El  and  E2  in  Figure 
3.6,  into  an  entity  type,  say  E  in  Figure  3.9,  the  data¬ 
base  designer  must  first  determine  whether  the  set  of 
entities  of  type  El  is  disjoint  from  the  set  of  entities 
of  type  E2.  If  El  and  E2  are  disjoint,  then  E  is  simply 
the  union  of  El  and  E2.  If  El  and  E2  are  not  disjoint, 
then  the  condition  under  which  two  entities  from  El  and  E2 
respectively  are  identical  must  be  specified.  To  specify 
the  condition  under  which  entities  are  identical,  entities 
of  El  and  E2  must  be  able  to  be  identified  by  their  attri- 
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butes.  Therefore,  for  each  entity  type  to  be  merged,  a 
function  or  combination  of  functions  of  the  entity  type 
must  be  a  primary  key.  Two  entities  from  two  entity  types 
being  merged  can  then  be  specified  to  be  identical  if  and 
only  if  they  have  identical  primary  key  values. 


In  Figure  3.10,  entity  types  El  and  E2  (which  are 
assumed  to  overlap) ,  are  merged  into  an  entity  type  E.  The 
syntax  used  is  a  subset  of  ADAPLEX.  Notice  that  "shipidl" 
and  "shipid2"  are  assumed  to  be  primary  keys  of  El  and  E2 
respectively.  Further,  it  is  assumed  that  an  El  entity 
and  an  E2  entity  are  identical  if  and  only  if  they  have 
the  same  primary  key  values. 


The  Mapping  Definition  of  Entity  Type  E  Figure  3.10 

.type  e  is  entity 
shipid  :  integer; 
class  :  code; 
end  entity; 

JLsul  e.ach  x  in  El  where  not  (shipidl(x)  isin 

shipid2(E2) ) 

loop 

create  new  E(shipid  =>  shipidl(x) 

class  =>  f  (classl (x) ) ) ; 

end  loop; 
for  each  x  in  E2 
loop 

create  new  E(shipid  =>  shipid2(x), 

class  =>  f (class2 (x) ) )  ; 


.end  1-Q.Qp; 
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3.4  Creation  of  a  New  Entity  Type  and  Its  Functions 

Merging  two  entity  types  into  a  single  entity  type  is 
a  special  case  of  creating  a  new  entity  type.  Essen¬ 
tially,  a  new  entity  type  may  be  created  which  is  a  combi¬ 
nation  of  the  existing  entity  types.  However,  this  combi¬ 
nation  does  not  create  new  objects  in  the  database. 
Rather,  it  simply  presents  many  existing  objects  of  dif¬ 
ferent  types  as  objects  of  a  single  type  to  the  global 
schema  users.  Properties  of  the  new  global  entities  are 
simply  those  that  previously  existed  in  the  local  sche¬ 
mata. 


However,  in  some  cases,  a  database  designer  may  want 
to  design  a  more  sophisticated  global  schema  in  which  new 
(virtual)  objects  derive  their  properties  (attributes) 
from  many  dissimilar  existing  objects.  An  example  is  used 
to  illustrate  this  process,  and  general  principles  can  be 
drawn  from  the  example. 

Suppose  a  global  schema  with  two  entity  types,  "sup¬ 
plier"  and  "parts",  is  to  be  designed  from  two  local  sche¬ 
mata  shown  in  Figure  3.11.  The  global  schema  must  capture 
all  the  information  contained  in  both  schemata.  Notice 
that  in  the  second  schema,  "supplier"  and  "parts"  entities 
do  not  exist,  but  their  existence  is  implied  by  the  pres¬ 
ence  of  supplier  numbers  and  part  numbers  :  "sno"  and 
"pno".  To  capture  this  information,  virtual  "supplier"  and 
"parts"  entities  corresponding  to  those  "sno"  and  "pno" 
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must  be  created  in  the  global  schema.  A  definition  of  the 
global  schema  is  shown  in  Figure  3.12.  Notice  that  in  the 
definition  primary  keys  "supplier.no"  and  "parts.no"  are 
used  to  map  the  new  entities  to  existing  entities  in  the 
first  schema  and  the  implied  entities  in  the  second 
schema. 


3.5  Data  Incompatibility 

Several  sources  of  data  imeompatibility  are  discussed 
in  this  section.  The  objective  of  the  discussion  is  to 
show  how  the  proposed  architecture  allows  us  to  incor¬ 
porate  our  present  understanding  of  incompatible  data  into 
the  Multibase.  The  details  of  solutions  to  the  problem 
are  to  be  fully  investigated  later  in  the  project. 

Some  sources  of  data  imprecision  are: 

a.  Scale  difference. 

For  example,  in  one  database  four  values  (cold,  cool, 
warm,  hot)  are  used  to  classify  climates  of  cities, 
while  in  another  database  the  average  temperatures  in 
Farenheit  may  be  recorded. 

b.  Level  of  Abstraction. 

For  example,  in  one  database  "labor  cost"  and 
"material  cost"  may  be  recorded  separately,  while  in 
another  they  are  combined  into  "total  cost".  Another 
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example  is  recording  an  employee's  "average  salary" 
instead  of  his  or  her  "salary  history"  for  the  previ¬ 
ous  five  years. 

c.  Inconsistency  Among  Copies  of  The  Same  Information. 
Certain  information  about  an  entity  may  appear  in 
several  databases,  and  the  values  may  be  different  due 
to  timing,  errors,  obsolescence,  etc. 

There  are  many  other  sources  of  data  incompatibility. 
Data  incompatibility  must  be  resolved  if  different  data¬ 
bases  are  to  be  integrated.  The  architecture  of  schema 
integration  developed  previously  can  be  extended  to  handle 
the  problem. 

Let  El  and  E2  be  two  entity  types,  and  fl  and  f2  be 
functions  defined  on  El  and  E2  respectively.  El  and  E2 
have  been  merged  into  an  entity  type  E,  then  fl  and  f2  can 
be  merged  into  the  function  f  defined  on  E  as  follows, 

f(e)  =  Tl(fl(e))  if  e  in  El-(E1  intersect  E2) 

T2(f2(e))  if  e  in  E2-(E1  intersect  E 2 ) 

g ( fl ( e)  , f 2 (e) )  if  e  in  (El  intersect  E2) 

The  transformations  T1  and  T2  are  typically  used  to 
map  the  ranges  of  fl  and  f2  into  a  common  range  as  dis¬ 
cussed  in  section  3.3.  On  the  other  hand,  the  function  g 
is  used  to  reconcile  any  inconsistencies  between  the 
values  of  fl  and  f2  over  the  same  entity.  Typically,  g 
will  involve  accessing  data  described  in  the  integration 
schema. 
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I 

I 

I 


Two  Local  Schemata  Figure  3.11 

supplierl  partsl 


supplyl 


I 


Local  schema  1 : 

type  supplierl  Is  entity 

sname  :  string; 

sno  :  integer; 

supply ingl  :  set  ht  supplyl; 

end  entity; 


type  partsl  is  entity 

pname  :  string; 

pno  :  integer; 

supplied-by  :  set  ei  supplyl; 

end  .en.ti.ty  ; 


type  supplyl  is  entity 
sno  ;  integer; 

pno  ;  integer; 

end  entity; 


Local  schema  2: 

supply2 


;no 


pno 


type  supply2  is  entity 
sno  ;  integer; 

pno  :  integer; 

end  entity; 


For  example,  in  Figure  3.13,  the  entity  types  LA  and 


ia^iAafaW« 
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A  Global  Schema  Figure  3.12 

type  supplier  is  entity 

sno  :  integer; 

supplying  :  set  of  parts;  - 

.end  entity; 


.type  parts  is  entity 
name:  string; 
no  :  integer; 
end  entity: 


ier  each  x  is  (sno (supplierl)  Union  sno (supply2) ) 
lfiPP 

create  supplier  (sno  =>  x) ; 
end  loop: 

for  each  y  is  (pno(parts<j  Union  pno (supply2) ) 
ioep 

create  parts  (pno  =>  y) ; 
end  loop: 


ier  eaeh  s  is  supplier  leap 

supplying (s)  :+  (p  is  parts  where f f or  some  yl  is  supplyl : 
sno(s)  =  sno(yl)  and  pno(p)=  pno (yl))  or 
(iox  some  y2  is  supply2  : 
sno(s)  =  sno(y2)  and  pno(p)  =  pno(y2)); 
end  loop; 


E5  are  merged  into  the  entity  type  E6  by  using  functions 
IS2  and  IS3  of  the  integration  database.  In  the  Figure, 
the  data  values  of  the  entities  and  functions  arc-  shown  in 
tabular  form.  In  this  example,  Tl  and  T2  transform  the 
climate  of  cities  from  two  different  scales, 
(cold, cool , warm, hot )  and  Farenheit,  into  a  unified  scale 
(temperature  range,  probability)  by  combining  E4  with  IS2 
and  E5  with  IS3.  The  function  g  could  return  all  the 
(temperature  range,  probability)  pairs  from  the  two  data¬ 
bases  without  any  further  processing,  as  shown  in  Figure 
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3.13,  or  it  could  use  some  statistical  technique  to  pro¬ 
cess  sets  of  (Temp  range,  probability)  pairs,  and  return  a 
simpler  but  descriptive  summary  of  those  pairs. 

For  example,  the  function  g  could  return  the  average 
value  and  the  standard  deviation  of  the  distribution 
represented  by  these  pairs;  it  can  make  statistical  esti¬ 
mation  and  return  a  confidence  interval;  or  it  can  do  time 
series  analysis  and  return  information  about  the  spectral 
function.  In  any  case,  it  is  a  research  problem,  and  will 
be  fully  investigated  later  in  the  project. 
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Example  of  Data  Incompatibility 


Figure  3.13 


E4  (of  LSI) 

IS2 

(of 

integration 

database) 

cityl 

climate 

climate 

range  of  temp 

probability 

Boston 

cold 

cold 

0 

-  20  F 

20% 

Norf or k 

cool 

cold 

20 

-  40  F 

40% 

Dallas 

warm 

cold 

40 

-  60  F 

25% 

Miami 

hot 

cold 

60 

-  80  F 

10% 

•  •  • 

•  •  • 

cold 

80 

-  100F 

5% 

cool 

0 

-  20  F 

10% 

cool 

•  •  • 

20 

•  • 

-  40  F 

• 

20% 

•  •  • 

E5  (of 

LS2) 

IS3 

(of 

integration 

database ) 

city2 

mean  temp 

mean  temp 

range  of  temp  probability 

Denver 

52  F 

52  F 

0  -  20  F 

20% 

Chicago 

54  F 

52  F 

20  -  40  F 

35% 

Los  Ang 

•  •  • 

75  F 

•  •  • 

52  F 

•  •  • 

40  -  60  F 

•  •  • 

30% 

•  •  • 

E6  (of  global  schema) 

city 

temp  range 

probability 

Boston  0  -  20  F  20% 

Boston  20  -  40  F  40% 
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4.  Run-Time  Query  Processing  Subsystem 
4.1  Overall  Architecture 

Now  we  will  show  how  the  specifications  developed 
during  schema  integration  are  utilized  to  drive  query  pro¬ 
cessing  over  the  global  schema.  As  we  discussed  in  section 
2,  the  run-time  subsystem  consists  of  a  query  translator 
and  a  query  processor.  Here  we  will  expand  these  tv/o  com¬ 
ponents  in  further  detail. 

A  "Global  Database  Manager"  (GDM)  is  that  part  of  the 
Multibase  System  which  consists  of  the  query  translator, 
and  the  query  processor.  A  query  over  the  global  schema 
is  normally  sent  to  the  nearest  site  which  has  a  Global 
Database  Manager  (GDM) .  There  may  be  one  or  more  GDM  in  a 
Multibase  system.  A  GDM  stores  a  copy  of  global  schema, 
local  schemata,  integration  schema,  and  the  mapping  defin¬ 
itions  among  them.  It  uses  this  information  to  parse, 
translate,  and  decompose  queries  over  global  schema  into 
local  queries  over  local  schemata,  and  coordinates  execu¬ 
tion  of  the  local  queries.  The  structure  of  a  GDM  and  its 
interface  with  local  DBMSs  is  shown  in  Figure  4.1. 

A  query  expressed  in  ADAPLEX  over  the  global  schema 
is  first  parsed  by  the  parser  and  a  parse  tree  is  gen¬ 
erated.  Components  of  the  parse  tree,  which  are  entities 
and  functions  of  the  global  schema,  are  then  replaced  by 
their  corresponding  definitions  which  are  expressed  in 
terms  of  the  local  schemata  LSs.  The  result  is  a  parse 
tree  consisting  of  entities  and  functions  of  the  local 


1 


The  Basic  Architecture  of  Multibase 
Run-Time  Query  Processing  Subsystem 


Page  -38- 
Section  4 


schemata.  The  parser  is  part  of  the  query  translator. 

The  parse  tree  is  then  simplified  to  eliminate  the 
inefficient  boolean  components.  For  example,  the  boolean 
expression  " ( a  >  5 ) or (a<20) "  is  reduced  to  "true",  and 
" ( a > 5 ) and (a<2) "  is  reduced  to  "false".  The  query  simplif¬ 
ier  is  also  part  of  the  query  translator. 

The  parse  tree  is  then  decomposed  by  the  decomposer 
into  subtrees.  Each  subtree  represents  a  local  query 
referencing  only  entities  and  functions  of  a  single  local 
schema. 

The  "ACCESS  PLANNER"  transforms  the  local  queries 
into  "data  movement"  and  "local  processing"  steps. 
Depending  on  the  memory  size  and  processing  power  of  each 
individual  site,  and  the  capacity  of  the  communication 
channels,  the  "ACCESS  PLANNER"  may  move  data  and  distri¬ 
bute  the  computing  load  among  sites,  or  it  may  move  data 
to  a  central  site  which  has  large  memory  and  computing 
power  and  do  most  of  the  processing  there.  In  doing  this 
planning,  the  "ACCESS  PLANNER"  tries  to  produce  steps 
which  minimize  the  cost  of  processing  the  query.  The  mean¬ 
ing  of  "cost"  depends  on  the  individual  systems  being 
integrated.  It  may  mean  the  amount  of  data  moved  between 
sites,  or  the  amount  of  processing  time. 

The  execution  of  the  access  plan  is  coordinated  by 
the  "EXECUTION  STRATEGIST".  It  sequences  the  steps  of  the 
access  plan  and  it  makes  sure  that  the  data  needed  by  a 
step  are  there  before  the  step  is  initiated. 
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The  "EXECUTION  STRATEGIST"  communicates  with  local 
DBMSs  through  the  Local  Database  Interface  (LDI).  The 
LDIs  receive  "data  move"  and  "local  processing"  steps  from 
the  "EXECUTION  STRATEGIST",  translate  these  steps  into 
programs  in  local  query  language  or  Data  Manipulation 
Language  (DML)  ,  or  call  local  routines  to  process  these 
steps,  and  translate  the  results  of  these  steps  into  the 
format  expected  by  the  "EXECUTION  STRATEGIST".  The  LDI 
may  reside  in  a  GDM  if  the  local  site  does  not  have  enough 
memory  or  cpu  power;  otherwise  it  resides  with  the  indivi¬ 
dual  local  DBMS  at  the  local  site. 

The  query  processor  to  be  described  in  this  section 
is  oriented  towards  the  initial  breadboard  system.  It  is 
designed  to  handle  restricted  versions  of  the  user  inter¬ 
face  language  and  view  mapping  language  with  reasonable 
efficiency.  Subsequent  research  is  needed  to  extend  the 
query  processor  to  efficiently  handle  the  unrestricted 
languages. 

Within  the  "Query  Processor",  the  database  is 
modelled  as  a  collection  of  record  types  and  links.  A 
link  L  from  record  type  R  to  record  type  S  is  a  function 
from  records  of  S  to  records  of  R;  S  is  called  the  owner 
record  type  and  R  is  called  the  member  record  type  rela¬ 
tive  to  L.  We  assume  that  if  L  links  R  to  S,  then  L,  R, 
and  S  are  all  stored  at  the  same  site.  We  also  assume 
that  there  is  a  database  schema  describing  the  record 
types  and  links  of  the  database. 
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We  will  sketch  the  Multibase  query  processing  stra¬ 
tegy  in  three  steps.  In  Section  4.2  we  define  the  set  of 
queries  that  can  be  posed.  In  Section  4.3  we  define  the 
set  of  basic  operations  that  Multibase  is  capable  of  exe¬ 
cuting.  In  Section  4.4  we  describe  how  to  translate  a 
query  into  a  sequence  of  basic  operations  that  solve  the 
query.  And,  in  section  4.5  we  describe  how  to  translate  a 
local  query  posed  over  a  Codasyl  local  host  schema  into  a 
program  in  a  low  level  Data  Manipulation  Language. 


4.2  Queries 


A  ouerv  consists 

of  a 

target  list 

and 

a  qualifica- 

tion.  A  target 

lXsJ; 

consists 

of 

a  set  of 

indexed  record  types,,. 

which 

are  terms 

of 

the  form  R.A 

where  R  is  a  record  type  and  A  is  a  field  of  R.  A  qualif¬ 
ication  is  a  conjunction  of  selection  clauses,  join 
clauses,  and  link  clauses.  A  selection  clause  is  a  for¬ 
mula  of  the  form  (R. A  op  k)  where  R.A  is  an  indexed  record 
type,  op  is  one  of  {  =  ,.£,<,>,2.^}  and  k  is  a  constant.  A 
join  clause  is  a  formula  of  the  form  (R.A  =  S.B)  where  R.A 
and  S.B  are  indexed  record  types.  A  link  clause  is  a  for¬ 
mula  of  the  form  (R-L->S)  where  L  is  a  link  from  R  to  S. 

Let  r  and  s  be  records  in  R  and  S  respectively.  We 
say  that  r  satisfies  the  selection  clause  (R.A  op  k)  if 
the  A-value  of  r  is  op-related  to  k  (written  (r.A  op  k)). 
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We  say  that  r  and  s  satisfy  the  join  clause  (R.A  =  S.B)  if 
the  A-value  of  r  equals  the  B-value  of  s  (i.e.,  r.A  = 
s.B) .  And,  we  say  that  r  and  s  satisfy  the  link  clause 
(R-L->S)  if  L  connects  r  and  s(i.e.,  r-L->s) . 

Let  Rl,...,  Rn  be  the  record  types  referenced  by 
qualification  q,  and  let  rl,...,rn  be  records  in  Rl,...,Rn 
respectively.  We  say  that  rl,...,rn  satisfy  the  qualifi¬ 
cation  q  if  rl,...,rn  satisfy  all  of  the  clauses  of  q. 

Let  Q  be  a  query  consisting  of  target  list 
T= (Ril .Ajl , . . . ,Rim. Ajm)  and  qualification  q.  Let 
Rl,...,Rn  be  the  record  types  referenced  in  T  and  q.  The 
answer  to  Q  is  the  set  of  all  records  of  the  form 
( ril .Ajl , . . . , rim. Ajm)  such  that  rl,...,rn  are  in  Rl,...,Rn 
(respectively)  and  rl,...,rn  satisfy  q.  Given  a  database 
Rl,...,Rn  and  a  query  Q,  our  goal  is  to  compute  the  answer 
to  Q  efficiently. 

A  query  graph  QG(N, E)  is  an  undirected  labelled  graph 
that  represents  a  query,  Q.  The  nodes,  N,  of  QG  are  the 
record  types  referenced  in  Q.  Each  node  is  labelled  by 
the  record  type  name  of  the  node,  the  fields  of  the  record 
type  that  appear  in  the  target  list,  and  the  selection 
clauses  of  Q's  qualification  that  reference  the  record 
type.  The  edge  set  E  of  QG  contains  one  edge  (R,S)  for 
each  join  clause  or  link  clause  that  references  R  and  S. 
Each  edge  is  labelled  by  its  corresponding  clause(s). 
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A  query  is  called  natural  if  (a)  join  clauses  are  of 
the  form  (R.A=S.A) ,  that  is,  the  fields  referenced  in  both 
indexed  record  types  in  a  join  clause  have  the  same  name; 
and  (b)  if  A  is  a  field  of  two  record  types  R  and  S,  then 
R.A  and  S.A  are  "connected"  by  a  sequence  of  join  clauses. 
There  is  a  simple  and  efficient  algorithm  that,  given  a 
database  description  and  a  query  Q,  renames  the  fields  of 
the  record  types  where  necessary  to  produce  an  equivalent 
natural  query  Q' ;  Q  and  Q'  are  equivalent  in  the  sense 
that  they  produce  the  same  answer  for  any  database  state 
(up  to  the  renaming  of  fields).  We  will  therefore  assume, 
without  the  loss  of  generality,  that  our  queries  are 
natural.  Given  that  we  deal  only  with  natural  queries, 
the  edge  labels  corresponding  to  join  clauses  are  unneces¬ 
sary.  Also  target  lists  need  only  contain  field  names, 
instead  of  indexed  record  types. 

Given  a  join  clause  (R.A=S.A)  and  a  selection  clause 
(R.A  op  k) ,  we  can  deduce  that  (S.A  op  k) .  We  assume  that 
the  qualification  of  each  query  is  augmented  by  all 
clauses  that  can  be  deduced  in  this  way.  A  simple  and 
efficient  transitive  closure  algorithm  is  sufficient  for 
performing  such  deductions. 
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4.3  Basic  Operations 

There  are  three  types  of  sites  in  the  breadboard  Mul¬ 
tibase:  OSIS,  Codasyl,  and  GDM.  Each  type  of  site  is 
capable  of  executing  a  different  set  of  basic  operations. 
This  section  describes  these  basic  operations. 

1 .  OSIS  Select 

If  record  type  R  is  stored  at  an  OSIS  site  S,  then  the 
only  operation  that  can  be  applied  to  R  at  S  is  a 
selection  of  the  form: 

R[(Al=kl)  and  <A2=k2)  and  ...  and  (An=kn)]. 

The  result  of  the  selection  is  a  record  type  consist¬ 
ing  of  the  set  of  all  records  r  in  R  such  that 
r(Ai}=ki  for  i=l,...,n;  this  result  is  always 
transmitted  to  GDM. 

Selections  are  currently  implemented  in  OSIS,  so 
no  additional  Multibase  software  need  be  written. 

2.  OSIS  Semij.Qijn 

in  principle,  OSIS  select  can  be  generalized  into  OSIS 
semijoin,  by  performing  selections  iteratively.  Let  R 
be  an  OSIS  file  and  S  a  GDM  file,  and  suppose 
Al,...,An  are  fields  of  R  and  S.  Then  the  semijoin  of 
R  by  S  on  Al,...,An,  denoted  R | A1 , . . . , An ] S ,  equals 
{r  in  R  |  (there  exist  s  in  S)(r.Al=s.Al  ...  r.An=s.An 
This  can  be  computed  by  the  following  program. 
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Result:=0; 

for  each  s  In  S 

loop 

kl:=s. Al , . . . ;  kn:=s.An; 

Result:=Result  U  R[(Al=kl)  ...  (An=kn)]; 
loop; 

In  practice,  this  operation  may  place  an  unacceptable 
load  on  the  OSIS  system  and  hence  may  not  be  usable. 

3.  Codasyl  Tree  Queries 

The  basic  operation  that  can  be  performed  at  a  Codasyl 
site  S  is  to  solve  a  natural  tree  query  (defined 
below) ,  returning  the  result  to  the  GDM.  A  natural 
tree  query  Q  at  site  S  has  two  properties:  (1)  All 

record  types  referenced  in  Q  must  be  stored  at  S.  (2) 

Let  Q'  be  Q  minus  its  join  clauses  (i.e.  all  clauses 
of  Q'  are  selections  or  links) ,  and  let  QG'  be  the 
query  graph  of  Q';  then  QG'  must  be  a  tree. 

To  solve  a  tree  query  Q  using  Codasyl  DML ,  one 

essentially  expands  the  cartesian  product  of  the 

record  types  referenced  by  Q  and  evaluates  the  qualif¬ 
ication  on  each  element  of  the  cartesian  product.  We 
describe  how  this  cartesian  product  can  be  systemati¬ 
cally  generated  in  Section  4.5. 

4.  Codasyl  Tree  Semijoins 

The  preceding  operation  can  be  generalized  into  a 
semi join-like  operation.  Let  Q  be  a  Codasyl  tree 
query  and  S  a  GDM  record  type,  and  suppose  Al,...,An 
are  fields  of  S  and  fields  of  record  types  of  Q.  Let 
Q'  have  the  same  qualification  as  Q,  and  the  target 
list  augmented  by  Al,...,An.  Finally,  let  R'  be  the 
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result  of  Q'.  The  semijoin  of  Q  by  S  on  Al,...,An, 
denoted  Q<A1 , . . . , An] ,  equals 

{r'in  R' | (there  exist  s  in  S) ( r * . A2=s. A2)  ...  ( r ' . An=s. An) } . 

This  can  be  computed  as  follows.  Suppose  Al,...,An 
are  fields  of  Rl,...,Rn  respectively  where  Rl,...,Rn 
are  record  types  of  Q.  (Rl,...,Rn  need  not  be  dis¬ 
tinct.)  Augment  the  qualification  of  Q'  by  adding  the 

clauses  (Rl.Al=kl)  ...  (Rn.An=kn).  And  execute  the 
following  program. 

\ 

Result:=  0; 

for  each  s  in  S  loop 

kl : =s. A1 ; . . . ;  kn:=s.An; 

Result:=  Result  U  Q' ; 
end  loop  * 

5.  GDM  Queries 

The  GDM  can  process  any  natural  query  Q  provided  (1) 

all  record  types  referenced  in  Q  are  stored  at  the 

GDM,  and  (2)  Q  contains  no  link  clauses.  Suppose  Q 
references  record  types  Rl,...,Rn.  Q  is  processed  by 
constructing  a  request  to  the  local  DBMS  (the  Datacom- 
puter  for  the  initial  breadboard  system)  of  the  form: 

for  each  rl  in  Rl  where  (selection  clauses  on  Rl) 
for  each  r2  in  R2  where  (selection  clauses  on  R2) 

and  (join  clauses  on  Rl  and  R2) 


iL&I  each  rn  in  Rn  where  (selection  clauses  on  Rn) 

and  (join  clauses  on  Rl  and  Rn) 
and  (join  clauses  on  R2  and  Rn) 


print 


and  (join  clauses  on  Rn-1  and  Rn) . 
(target  list) . 


It  is  important  that  the  "for 


statements  be  in  a 
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"reasonable"  order  for  performance  reasons.  Optimiza¬ 
tion  techniques  developed  by  Wong  for  the  SDD-1  DM 
[Wong]  are  directly  applicable. 


4.4  Query  Decomposition 

lo  solve  a  query  Q,  we  must  decompose  it  into  a 
sequence  of  basic  operations.  Our  basic  strategy  is  to 
find  subqueries  of  Q  that  can  be  entirely  solved  at  OSIS 
and  Codasyl  sites,  move  the  results  of  these  subqueries  to 
GDM,  and  solve  the  remainder  of  the  query  at  GDM . 

To  follow  this  strategy,  we  must  isolate  OSIS  and 
Codasyl  subqueries  of  Q.  OSIS  subqueries  are  easy  to 
find.  We  simply  find  record  types  in  Q  that  are  stored  at 
OSIS  sites.  For  each  such  record  type  R,  we  produce  a 
subquery  consisting  of  the  selection  clauses  on  R. 

Let  QG  be  the  query  graph  of  Q.  To  find  Codasyl 
subqueries,  we  begin  by  deleting  from  QG  all  record  types 
not  stored  at  a  Codasyl  site  and  all  join  clauses.  Each 
connected  component  of  the  resulting  graph  includes  record 
types  and  links  that  are  stored  at  the  same  site,  because 
no  link  can  connect  two  record  types  stored  at  different 
sites  (cf.  Section  4.1) .  If  a  connected  component  is  a 
tree,  then  it  corresponds  to  a  tree  query  and  can  be 
solved  by  the  Codasyl  site.  If  it  has  a  cycle,  then  it 
must  be  further  decomposed  into  two  or  more  tree  queries. 
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(In  the  breadboard  version  of  Multibase,  we  will  only  han¬ 
dle  queries  whose  Codasyl  subqueries  are  tree  queries;  if 
some  Codasyl  subquery  is  cyclic,  the  query  cannot  be  pro¬ 
cessed)  . 

Having  extracted  the  OSIS  and  Codasyl  subqueries,  we 
must  now  choose  an  order  for  these  subqueries  to  be  exe¬ 
cuted.  As  a  first-cut  solution,  we  propose  to  solve  all 
OSIS  and  Codasyl  subqueries  before  processing  the  results 
of  any  of  these  subqueries  at  the  GDM.  This  strategy  will 
be  an  especially  poor  performer  if  an  OSIS  or  Codasyl 
subquery  has  no  selection  clauses.  For  such  cases,  we 
recommend  use  of  OSIS  and  Codasyl  semijoin  operations,  so 
that  the  results  of  some  subqueries  can  be  used  to  reduce 
the  cost  of  other  subqueries.  However,  this  tactic  brings 
us  into  the  realm  of  new  query  optimization  algorithms  and 
will  require  further  research. 


4.5  Processing  Codasyl  Tree  Queries 

Let  Q  be  a  Codasyl  tree  query  and  QG  its  tree.  The 
following  algorithm  compiles  Q  into  a  program  that  solves 
Q.  The  program  contains  statements  of  the  form: 

a)  for  R  in  set(S)  ...  end  ;  where  S  owns  R  via  set  ; 

b)  R : =set_inv ( S)  ;  where  R  owns  S  via  set.  Note  that 
set-inv  is  the  inverse  function  of  set  and  is  always  a 


f  uncti on 
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Algorithm 

1 .  Do  a  pre-order  traversal  of  QG.  The  result  is  a  list 
of  the  nodes  of  QG.  Call  this  list  P. 

2.  Let  R  and  S  be  nodes  of  QG;  with  R  the  parent  of  S. 

Cases 

R  is  the  root  of  QG;  replace  1 2 R ’  by  "for  R"  in  P. 
R  owns  S:  replace  "S"  by  "for  S  in  set(R)"  in  P. 

S  owns  R:  replace  "S"  by  "S=set_inv ( R) "  in  P. 

3.  Push  loop  independent  assignments  up  as  high  as  possi¬ 
ble. 

4.  Add  an  "output  (target  list)"  statement,  add  selec¬ 
tions,  and  joins  as  high  as  possible,  tack  on  enough 
ends  to  balance  the  f ors. 

Example  :  Let  QG= 

R 


1.  Preorder  traversal:  R,S,T,U,V. 

2 .  P :  for  R 

IflX  S  in  LI (R) 

T : =  L2_inv(R) 
for  U  in  L3 (T) 

V : =L4_inv (T) 
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3.  T  and  V  can  be  pushed  up;  add  output 
etc. 


for  R 

T:=  L2_inv(R) 

V:=  L4_inv(T) 
for  S  in  LI ( R) 

for  U  in  L3(T) 

output  (target  list) 
end  end  end 


I 


L 


statement ; 
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5.  The  MULTIBASE  Language 

DAPLEX,  embedded  in  the  programming  language  ADA,  is 
to  be  used  both  as  the  user-interface  language  and  as  the 
mapping  language  in  MULTIBASE.  In  this  section  we  first 
briefly  review  the  salient  features  of  the  functional  data 
model  and  DAPLEX;  we  then  restrict  DAPLEX  to  appropriate 
subsets,  which  will  be  used  as  the  mapping  language  for 
the  initial  breadboard  system;  finally,  we  illustrate  the 
use  of  this  language  for  defining  a  global  schema,  and  for 
mapping  queries  against  a  global  schema  into  queries 
against  the  collection  of  local  databases. 


5.1  The  Functional  Data  Model  and  DAPLEX 

The  basic  constructs  of  the  functional  data  model  are 
the  entity  and  the  function.  Entities  are  intended  to 
represent  real-world  objects,  and  functions  to  represent 
properties  of  these  objects,  or  relationships  among 
objects.  Functions  may  be  single-valued  or  multi-valued. 

More  precisely,  in  every  state  of  the  database,  there 
is  a  universal  set  E  of  abstract  elements.  Each  element 
in  E  is  an  entity  and  is  labelled  with  one  or  more  types. 
Entity  types  are  declared  as  follows: 


type  x  is  .entity 

end  entity; 
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In  any  state  of  the  database,  the  extension  of  an  entity 
type  X  is  the  set  of  all  elements  in  E  of  type  X.  We 
denote  the  extension  of  type  X  also  by  X. 

There  are  some  pre-defined  base  types,  e.g.,  String, 
Integer,  Boolean,  whose  extensions  cannot  be  modified. 
Entities  of  only  these  base  types  can  be  printed  (this 
implies  that  they  have  values.  which  can  be  printed) . 
Initially,  the  set  E  contains  only  these  base  entities. 
We  shall  describe  shortly  how  entites  of  other  types  are 
created  and  added  to  E.  DAPLEX  permits  the  definition  of 
subtypes  (cf.  the  generalization  or  ISA  hierarchy  [Smith 
and  Smith;  Mylopoulos,  et  al]),  e.g.,  subtype  Student  is 
Person.  This  states  that  a  student  entity  is  automati¬ 
cally  also  a  Person  entity;  thus,  the  extension  of  student 
is  a  subset  of  the  extension  of  Person. 

An  entity  type  declaration  includes  the  declaration 
of  all  entity  functions  applicable  to  entities  of  this 
type.  For  example: 

iy&fi  Dept  is  entity 

DeptNo  :  String; 

Emps  :  set  of  Employee; 

£nd  entity; 

Here  we  are  declaring  two  functions;  DeptNo  from  the  set 
of  Dept  entities  into  the  set  of  string  entites,  and  Emps 
from  the  set  of  Dept  entities  into  the  powerset  of 
Employee  entities,  i.e.,  DeptNo:  ->  String,  Emps:  Dept 
->P (Employee) . 
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Functions  can  be  composed  in  the  usual  way.  Thus, 
given  major:  Student  ->  Dept  and  DeptNo:  Dept  ->  String, 
we  can  write  DeptNo (major (s) )  for  any  s  in  Student.  Also, 
to  compose  multivalued  functions  (or  a  single-valued  func¬ 
tion  and  a  multivalued  one) ,  DAPLEX  permits  the  natural 
extension  of  a  function  f  on  a  set  X  to  the  corresponding 
function  (also  called  f)  on  the  powerset  of  X.  Thus, 

f :  X  ->  Y  is  extended  to 

f:  P(X)  ->  P ( Y )  where,  for  X'  C:  X,  f(X')  =  { f ( x )  |  x  in  X' 
Similarly , 

g:  X  ->  P(Y)  is  extended  to 

g:  P(X)  ->  P ( Y )  where,  for  X'  C  X,  g(X')  =  U  g(x) 

x  in  X' 

These  extended  functions  can  then  be  used  exactly  as  the 
original  functions.  Thus,  we  can  write  SocSecNo (Emps (d) ) , 
for  any  d  in  Dept. 

Retrieval  queries,  update  requests,  and  view  defini¬ 
tions  in  DAPLEX  are  formulated  using  statements  and 
expressions.  Statements  include  the  data  definition 
statements,  FOR  loops,  and  the  print,  create,  and  assign¬ 
ment  statements.  (Of  course,  when  DAPLEX  is  embedded  in 
ADA,  the  entire  armory  of  ADA  statements  can  be  used.) 
Expressions,  which  appear  within  statements,  are  actually 
set  definitions,  and  hence  may  involve  entity  variables 
(ranging  over  entites  of  a  specified  type),  functions, 
arithmetic  comparison  operations  (=,/,  etc.),  the  Boolean 
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logical  operators  (AND,  OR,  NOT),  quantifiers,  set  predi¬ 
cates  (includion,  containment,  equality),  and  set  opera¬ 
tors  (U,D  ,  -) .  We  shall  not  attempt  a  complete,  rigorous 
syntactic  or  semantic  description  of  DAPLEX.  Rather,  we 
give  an  illustrative  example  of  schema  definition, 
queries,  and  entity  creation  in  Figure  5.1. 

An  entity  of  a  specified  type  is  created  and  added  to 
the  set  E  by  means  of  the  CREATE  statement  (Figure 
5.1(c)).  This  statement  permits  the  initialization  of 
values  to  some  or  all  of  the  applicable  functions.  These 
values  can  be  updated  later  using  assignment  statements. 
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Figure  5.1 

Example  of  Schema  Definition,  Queries,  and  Entity  Creation 

(a)  schema  definition 
-type  Dept  Is  entity 

DeptName:  String; 

Courses  :  set  of  Course; 
and  entity; 


type  Student  Is  entity 


SNO 

:  String; 

Major 

:  Dept; 

Name 

:  String; 

Courses 

:  set  of  Course; 

end  entity; 

type  Course  is  entity 

CNo 

:  String; 

Title 

:  String; 

Credits 

:  Integer; 

and  entity; 

(b)  Queries 

Oueryl :  (for  each  student  majoring  in 

Classics,  print  the  student's  name) 

lar  aaab  s  Is  Student  where 

DeptName (Maj or ( s ) )  =  'Classics'  loop 
mini.  (Name  (s)  )  ; 
and  loop; 

Query2 ;  (for  each  department  that  has  some  student 
majoring  in  it  and  enrolled  in  a 
6  credit  course  not  offered  by  the  department, 
print  the  name  of  the  department) 

lar  aaab  d  Is  Dept  where 
(for  some  s  Is  Student  ;  d=Major(s)  and 
for  some  c  Is  Course (s) -Courses (d)  :  Credits (c) =6) 

Xo.QP 

print  (DeptName  (d) )  ; 
and  leap; 

(c)  Creation  of  a  new  entity; 

create  Student  (S NO  =>  '1234',  Name  =>  'J.  Doe', 

Major  =>  the  (d  Is  Dept 
where  DeptName (d) ='CS' ) ; 
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5.2  Subset  of  DAPLEX  as  the  Mapping  Language 


By  the  "mapping  language",  we  mean  the  language  in 
which  the  global  schema  is  defined  as  a  view  of  the  local 
schemata,  and  in  which  global  queries  are  mapped  into 
queries  that  can  be  processed  against  the  local  databases. 

Theoretically,  it  is  always  possible  to  map  a  query 
against  a  view  into  an  equivalent  query  against  the  under¬ 
lying  database.  To  see  this,  let  D  be  the  state  of  the 
underlying  database,  and  V  be  the  state  of  a  view  defined 
on  it.  We  model  the  view  definition  as  a  function  DEF 
from  database  states  to  view  states.  Let  q  be  a  query 
against  V.  We  model  q  as  a  function  that  maps  view  states 
into  states  of  the  result.  Then,  there  is  a  unique  func- 


tion  Q  (the 

composition 

of  DEF 

and  q)  that 

makes  the 

diagram  of 

Fig.  5.2 

commute. 

Furthermore, 

if  Q  is 

expressed  as  a  query  in  any  well-defined  language  (i.e.,  a 
language  for  which  operational  semantics  can  be  written) , 
then  it  can  always  be  processed  against  the  underlying 
database,  given  sufficient  computational  power  and  a 
workspace  for  storing  intermediate  results. 

Thus,  one  strategy  for  processing  global  queries  is 
by  view  construction;  materialize  the  global  state  by 
applying  the  global  schema  definition  to  the  local  data¬ 
base  states;  then  execute  the  global  query  against  the 
global  state.  Clearly,  this  is  a  grossly  inefficient 
strategy.  We  would  like  to  use  query  modification 


v*;-  Vv.  ■ 
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Global  Query  Mapping 


Figure  5.2 


DEF 


instead:  use  the  global  schema  definition  to  modify  the 
global  query  into  a  query  against  the  state  of  the  "compo¬ 
site  schema",  i.e.,  the  disjoint  union  of  the  local  data¬ 
bases.  How  easy  it  is  to  perform  this  syntactic  transfor¬ 
mation  depends  strongly  upon  the  choice  of  the  mapping 
language  in  which  the  global  query  and  the  global  schema 
definition  are  expressed.  Even  more  important,  perhaps, 
is  the  problem  of  global  optimization.  The  more  compli¬ 
cated  the  queries  after  modification,  the  harder  it  will 
be  to  find  efficient  distributed  query  processing  stra¬ 
tegies. 

The  question,  therefore,  is  not  one  of  being  able  to 
process  global  queries  at  all,  but  rather  one  of  practi¬ 
cality.  It  is  a  problem  of  tradeoff  between  convenience 
to  users  (i.e.,  very  powerful  view  mechanism)  and 
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efficient  implementation  (i.e.,  fairly  restricted  view 
mechanism) .  Initially,  we  will  start  off  with  a  simple, 
easy- to-implement  view  mechanism.  But  ultimately,  we  will 
explore  the  spectrum  of  options  and  select  the  most  desir¬ 
able  view  mechanism. 

The  restricted  subset  of  DAPLEX  is  described  in  the 
sequel . 

1.  Retrieval  queries: 

for  each  (range  list)  I  where  qualification]  loop 
{print  (target-list)}; 

.end  loop; 

Here,  range-list  is  a  list  of  range  predicates  of  the 
form  "entity_variable  In  range"  where  range  is  an 
entity  type.  Only  one  variable  is  allowed  to  range 
over  each  entity  type.  The  "range-list"  construct  is 
actually  a  minor  departure  from  the  DAPLEX  form  of 
nested  " for  each"  loops.  This  departure  is  a  useful 
abbreviation  in  the  context  of  the  restricted  syntax. 
However,  the  more  general  form  of  nested  loops  will  be 
used  as  the  syntax  is  extended. 

Qualification  is  a  conjunction  of 

i.  selection  clauses  of  the  type  ”f(x)  op  c"  where  f 
is  a  single-valued  function,  x  an  entity  variable, 
c  a  constant,  and  op  is  an  arithmetic  comparison 
operation. 

ii.  join  clauses  of  the  type  ”f(x)  =  g(y)",  where  x 
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and  y  are  entity_variables,  f  and  g  are  single- 
valued-functions  that  have  base  entities  as 
values; 

iii.  link  clauses  of  the  type  x  =  f(y)  where  x,  y  are 
entity_variables  and  f  is  a  single-valued- 
function,  or  x  isin  f(y)  where  f  is  a  multi-valued 
function; 

target-list  is  a  list  of  components  of  the  type 
single-valued-function  :=  f(x)  or 
single-valued  function  :=  c 
[here  x  is  an  entity_variable,  c  a  constant,  and  f 
a  singlevalued  function. 

Figure  5.3  shows  how  the  queries  of  Figure  5.1 
(b)  can  be  expressed  in  this  subset. 

2.  View  definition; 

Changing  the  imperative  print  to  create  results  in  a 
statement  that  defines  the  extension  of  a  view  entity 
type,  and  of  those  single-valued  functions  that  take 
base  values.  For  defining  "links",  i.e.,  functions 
that  take  other  entities  as  values,  the  assignment 
statement  must  be  used.  Thus, 
f(r)  ;+  (variable-range  where  qualification) 
if  f  is  multivalued; 

or  f(r)  :=  the  (variable-range  where  qualification) 
if  f  is  single-valued.  Var i abl e- range  is 


entity_variable  in  range". 
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Example  of  Queries  in  the  Restricted  DAPLEX  Figure  5.3 

(a)  for  each  (s  In  Student,  d  in  Dept) 

where  d  =  Major(s)  and  DeptName(d)  =  'Classics'  loop 
print  (Name (s) ) ; 
nnd  loop; 


(b)  int  each  (d  in  Dept,  s  in  Student,  c  in 
ut.se) 

where  d  =  Major(s)  and  Credits(c)  =  6  and 
c  in  Course(s)  and  not  (c  in  Courses(d)) 
loop  print  (DeptName (d) ) ; 
end  loop: 


Note  that  we  preclude  unnormalized  structures  (i.e. 
records  with  repeating  groups)  from  the  view;  these  must 
be  explicity  defined  as  hierarchies.  Figure  5.4  gives  an 
example  of  a  view  defined  in  this  subset  of  DAPLEX,  and 
shows  a  query  on  the  view  and,  in  modified  form,  '  on  the 
underlying  database. 

The  subset  of  ADAPLEX  that  we  have  just  described 
makes  the  following  simplifications: 

1.  Set  expressions  in  range  predicates  and  qualifications 
have  been  "flattened  out",  and  quantifiers  eliminated. 
This  allows  us  to  utilize  existing  view  algorithms  for 
relational  databases.  Further  research  will  be 
devoted  to  handling  the  novel  aspects  of  view  process¬ 
ing  in  the  DAPLEX  functional  model. 

2.  The  type-subtype  hierarchy  is  not  explicitly  handled. 
This  hierarchy  will  no  doubt  be  useful  in  the  schema 
integration  step.  However,  the  mechanics  of  inter¬ 
preting  queries  against  the  hierarchy  require  further 


research. 
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Example  of  A  View  Mapping  Definition  Figure  5.4 

a)  View  defined  over  the  schema  of  Fig  5.1(a) 
views 

type  enrollment  i_£  entity 
SNo  :  String; 

CNo  :  String; 

Credits  :  Integer; 

offered_by:  VDept; 
major  :  VDept; 

.gn.cl  entity  ; 

extent 

for  each  (s  in  Student,  c  in  Course) 
where  c  in  Course(s)  loop 

create  enrollment  (SNo  =>  SNo  (s) ,  CNo  =>  CNo  (c) , 

Credits  =>  Credits  ( c ) ) ; 

end  lo-op; 

for  each  (d  in  Dept)  loop 

create  VDept  (Dname  =>  DeptName (d) ) ; 
end  Iq.qe; 

for  each  (e  in  enrollment,  c  in  Course, 
d  in  Dept,  s  in  Student)  loop 
of fered_by (e)  =>  the  (vd  in  VDept  where  (CNo (c) =CNo (e ) 
and  DName(vd)  =  DeptName (d)  and  c  in  Course (d)); 
major (e)  =>  the  (vd  in  VDept  where  d=major(s) 

and  SNo (s) =SNo (e )  and  DName (vd) =DeptName (d) ) ; 

end  leap; 
end  extent ; 

b)  Query  on  the  view: 

for  each  (e  in  enrollment,  vd  In  VDept) 
where  d  4  of f ered_by (e)  and  vd  =  Major (e) 
and  credits(e)  =  6  loop 
print  (DName(vd),  SNO(e)); 

end  leep; 

c)  Modified  query  on  underlying  entities: 

i&Jt  each  (s  in  Student,  c  in  Course,  d  in  Dept) 
where  not  (c  in  Courses(d))  and  d=Major(s) 
and  Credits(c)  =  6  and  c  in  Courses(s)  loop 
PRINT  (DeptName (d) , SNo (s ) ) ; 
end  loop: 


type  VDept  is  entity 

DName  :  String; 
end  entity : 
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5.3  Global  Schema  Definition  and  Extension  of  the  Mapping 
Language  Subset 

Our  approach  to  schema  integration  is  to  provide  a 
view  support  facility;  the  database  administrator  can  use 
this  facility  to  define  the  global  schema  as  a  view  of  the 
collection  of  local  schemata.  The  simplest  global  schema 
is  the  disjoint  union  of  the  local  schemata  (i.e.,  when 
the  view  definition  is  the  identity  function);  this  places 
all  of  the  burden  of  integration  on  the  users  querying  the 
database.  Alternatively  the  DBA  can  define  a  "more 
integrated"  view.  Providing  powerful  constructs  (e.g., 
the  generalization  hierarchy)  in  the  mapping  language  will 
make  this  task  easier. 

In  the  last  section,  we  defined  a  subset  of  DAPLEX  to 
be  used  initially  as  the  mapping  language.  This  subset  is 
close  in  power  to  view  mapping  facilities  proposed  or  pro¬ 
vided  in  state-of-the-art  DBMS.  This  is  the  subset  han¬ 
dled  by  our  query  processing  algorithm  described  in  Sec¬ 
tion  4.  We  now  suggest  important  extensions  to  this  sub¬ 
set,  which  we  feel  are  needed  for  defining  the  global 
schema  as  views  of  the  local  schemata  in  the  multibase 
context.  Extending  the  query  processing  algorithm  to  han¬ 
dle  queries  in  this  larger  class  will  require  additional 
research. 

Schema  integration  often  requires  taking  unions  of 
entity  sets  from  the  local  databases.  Thus,  range 
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declarations  will  be  extended  to  permit  set-algebraic  com¬ 
binations  of  entity  types  and  also  functions  applied  to 
entity  types.  Similarly,  the  target  list  of  a  query  or 
view  definition  statement  will  be  extended  to  permit  set 
operations. 

5.4  Translating  the  DAPLEX  Subset  into  Internal  Form 

This  section  relates  the  subset  of  DAPLEX  described 
in  Section  5.2  to  the  input  language  of  the  query  proces¬ 
sor  described  in  Section  4.2. 

Given  query  q  in  this  subset  of  DAPLEX,  transform  it 
as  follows: 

1.  Replace  each  occurrence  of  f(x),  where  x  is  an  entity 
variable  with  range  X  and  f  a  single-valued  function 
with  base  values,  by  X.f 

2.  Replace  every  link  clause  of  the  type  "x=f(y)M  or  "x 

in  f(y)"  by  Y-f->X,  where  Y  is  the  range  of  y  and  X 

the  range  of  x. 

3.  Replace  every  join  clause  of  the  type  ”f(x)=g(y)", 

where  f:  X  ->  Z,  g:  Y  ->  Z,  Y  is  the  range  of  y,  X  is 

the  range  of  x,  and  Z  is  not  a  base  entity  type,  by 

X- f — > Z  and  Y-g->Z. 
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6.  Summary 

This  report  describes  the  architecture  of  the  Multi¬ 
base  system.  Details  of  the  components  of  the  architec¬ 
ture  to  be  implemented  in  the  initial  breadboard  version 
are  also  described.  Although  additional  research  is 
required  to  fill  in  the  details  of  optimization  and  incom¬ 
patible  data  handling,  the  architecture  already  contains 
several  innovative  ideas  in  integrating  distributed 
heterogeneous  databases.  These  include  : 

i.  the  idea  of  using  an  integration  database  to  resolve 
data  incompatibility; 

ii.  the  idea  of  using  a  mapping  language  to  uniformly 
define  the  global  schema  in  terms  of  the  local  sche¬ 
mata  and  the  integration  schema; 

iii.  and  the  idea  of  using  query  modification  and  query 
graph  decomposition  to  transform  a  global  query  into 
local  queries  and  queries  over  the  integration  data¬ 
base  . 
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